While my example is graphical, only the SDL frame buffer capability of the GPU is being used. All object rendering computations are done by my CUDA code running on the GPU -- the card doesn't know this is a graphical computation.
The NVIDIA GPU has multiple multiprocessors: 12 on the GeForce 8800 GTS I tested. Each multiprocessor has 8 processors, and with them each multiprocessor can support up to 768 concurrent threads.
Under CUDA the programmer writes a kernel which is the code executed by a single thread. Threads are placed in two or three-dimensional groupings called blocks, and blocks in turn are placed in two-dimensional groupings called grids. The run-time system schedules blocks and grids to run as resources are available. The general computational model is SIMD: the same kernel instructions are executed by all threads. The NVIDIA C compiler supports fairly standard C with branches, loops, etc., but no recursion.
Like the SPUs on the PS3 Cell processor, GPU memory is limited and access to host memory is slow. Data is explicitly moved from the host to the device by the host program. Here data is in device global memory (shared between all multiprocessors), which is still quite slow. The kernel threads can copy data from global memory to shared memory (shared between processors in a single multiprocessor), though shared memory is limited to 16K per multiprocessor.
With this design CUDA is not a practical way to implement a ray-tracer. The 16K shared memory restriction makes realistic world models impractical. It appears the number of threads which can be started simultaneously is constrained by the size of the world state structure:
Here we see that with a small number of objects in the world the program runs faster with shared memory. With a large number of objects in the world it is actually faster without shared memory (and shared memory cannot fit more than 225 objects). The Shared memory speedup is independent of whether all the objects are actually rendered. The program bottleneck is not in looping over the objects, but in the speed of access to object data and the number of threads which can be simultaneously started.
Below shows CPU and memory utilization during NVIDIA CUDA GPGPU, generic 1 pthread, and generic 4 pthreads tests. In the CUDA test my ray program used 92% of one CPU and the Xorg X-windows server used 12% of another CPU.
int screenX;
#ifdef USE_OMP
#pragma omp parallel for firstprivate(portPoint)
#endif
for (screenX = 0;
screenX < SCREEN_WIDTH;
screenX++) {
With this change each pixel is rendered by a separate thread, with a maximum
of 4 (the number of cores) running simultaneously. This use of threads is
an alternative to the explicit pthread threading tested above, where a
separate thread
is used per line of the screen. In this case OpenMP is simpler than explicit
threading, though it yields slightly lower performance.
GCC 4.2 also supports OpenMP, but its performance is much worse than Intel.